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Abstract 

An empirical Bayes problem has an unknown prior to be estimated from data. 
The predictive recursion (PR) algorithm provides fast nonparametric estimation of 
mixing distributions and is ideally suited for empirical Bayes applications. This 
£^ . paper presents a general notion of empirical Bayes asymptotic optimality, and it 



is shown that PR-based procedures satisfy this property under certain conditions. 
As an application, the problem of in-season prediction of baseball batting averages 
is considered. There the PR-based empirical Bayes rule performs well in terms of 
prediction error and ability to capture the distribution of the latent features. 



Keywords and phrases: Batting average; compound decision problem; density 
I> ' estimation; high-dimensional; mixture model. 

CO 

in; 1 Introduction 

o: 

In large-scale inference problems, the work of Stein suggests that component- wise optimal 
procedures are typically sub-optimal in the simultaneous inference problem. The common 
theme in all works related to simultaneous inference is a notion of "borrowing strength" — 
using information about all cases for each compo nent problem. An important exa mple is 



the false discovery rate controlling procedure of iBenjamini and Hochbergl (119951 ) which 
uses the data itself to determine the critical region for the sequence of tests. Shrinkage 
rules, penalized estimation, and hierarchical Bayes inference all can be given a similar 
"information sharing" interpretation. 

One interesting approach to simultaneous inference is empirical Bayes, where a fully 
Bayesian model is assumed but, rather than elicitation of subjective priors or construction 
of non-informative objective priors, one uses the data itself to estimate the prior. Para- 
metric empirical Bayes, where a parametric form is ass umed for the unknown prior, has 



been given considerable attention in the literature; see lEfronl (120101 ) and the references 
therein. When the number of cases is relatively small, the parametric approach is most 
reasonable. Indeed, for Robbins' brand of nonparametric empirical Bayes to be success- 
ful, a tremendously large number of cases are needed. But high-dimensional inference 
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problems are now commonplace in statisti cal applicati ons, so nonparametric empirical 
Bayes is now a promising area of research. lEfronl (120031 . p. 369) writes 



What was unimaginable [then] is commonplace today. Nonparametric empir- 
ical Bayes applies in an almost off-the-shelf manner to microarrays. 

Theoretical analysis of empirical Bayes procedures looks at the limiting properties of 
the corresponding risk. After a description of the decision problem and empirical Bayes 
approach in Sections [2H3l I propose an apparently new notion of asymptotic optimality. 
Here I say that an empirical Bayes rule is asymptotically optimal if its risk (a function of 
observable data) converges almost su rely to the Bay es risk. Compare this to the classical 
definition of asymptotic optimality in lRobbinsi ( 119641 ) based on convergence in mean of the 
empirical Bayes risk. While neither definition is mathematically stronger than the other, 
I believe there is a considerable difference from a statistical point of view. In particular, 
convergence in mean is not especially meaningful to a Bayesian who does not believe in 
averaging risk over the sample space. Theorem [1] gives a set of sufficient conditions for 
asymptotic optimality in this apparently new almost-sure sense. 

To implement nonparametric empirical Bayes, one needs a nonparametric estimate 
of the prior/mixing distribution. This, in itself, is a challenging theoretical and compu- 
tational problem. The most popular techniques are based on nonparametric maximum 
likelihood and kernel estimators. Two recent references on these in the context of em- 



pirical Bayes inference are iBrown and Greenshteinl ( 120091 ) and iJiang and Zhangl (]2009[ ). 
But these methods can be computationally expensive and they are primarily focused on 
the Gaussian location problem. A promising alternative is the predictive recursion (PR) 
algorithm, designed for fast nonparametric esti mation of mixing distr ibut ions in arbitrar y 
mixture model problems, not only Gaussian; see iNewton et aD fjl99«h and lNewtonl (12002^ 1 . 
PR seems ideally suited for the empirical Bayes problem for, given the PR estimate, a 
plug-in empirical Bayes estimate of the optimal Bayes rule is immediately available. 

Performance of the PR-based empirical Bayes rule depends on convergence properties 
of the estimates produced by PR, and a fa irly detailed picture of PR 's convergence prop- 
erties is now available. For finite mixtures, iGhosh and Tokdarl (I2006[) proved conv ergence 
of PR under strong conditions on the mixture kernel; iMartin and Ghoshl ( 2008 ) extend 
this result using tools from stochastic approximation theory; and iMartinl (|2012bl ) estab- 
lished a nearly root-n rate of co nvergence. The gene ral case, described in more detail in 
Section [5], was first attacked bv iTokdar et al.l (120091 ). They showed that, under suitable 
conditions, the PR estimates of the mixing and mixture dist ributions are both strongly 
consistent in the weak- and Li-topologies, respectively. Later, IMartin and Tokdarl (120091 ) 
established convergence properties of the PR estimates under model mis-specification, 
and also gave a bound on the rate of convergence. 

In Section [5j I use the known convergence theory for PR together with Theorem [1] 
to show that the PR-based empirical Bayes rules are asymptotically optimal, under cer- 
tain conditions, in hypothesis testing and point estimation problems. Section [6] contains 
a comparison of the PR-based empirical Bayes rules with several other parametric and 
nonparametric empirical Bayes rules in an interesting example of predicting batting av- 
erages in major league baseball. It turns out that the PR-based rule is competitive with 
the others in the prediction problem, but is more flexible and gives a realistic picture of 
the distribution of latent hitting abilities. Martin and Tokdarl (120121 ) make a similar con- 
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elusion concerning the potential of PR-based empirical Bayes in the large-scale multiple 
testing applications. These results together suggest that PR-based empirical Bayes is a 
promising alternative to existing methods and worthy of further investigation. 



2 The decision problem 
2.1 Basic definitions 

The general decision problem has several components. First is parameter space that 
contains the unknown quantity of interest 9, often called the "state of nature." Second 
is an action space A, containing all possible actions, or decisions, a. Third, there is 
a loss function L(a, 9) > that represents the penalty for taking action a when the 
state of nature is 9. Finally, there is observable data Y taking values in a measurable 
space (Y, ( 3 r ), equipped with a cr-finite measure \i. When the state of nature is 9, the 
sampling distribution of Y, taking values in Y, is Pg and its density is po = dPg/dfi. 
In the theoretical analysis that follows, I shall take each of these components as given. 
However, these components themselves — particularly the loss function L(a, 9) and the 
model pg — are often quite difficult to elicit in pra ctice. For this reas on, there has been 



extensive work on loss and model robustness (e.g., iGhosh et al.ll2006l . Sec. 3.10-3.11). 

With these four components in place, I can now describe the statistical decision prob- 
lem. When data Y = y is observed, action 8(y) £ A is taken. Action 5(y) is called a 
decision rule. Then the average loss, or risk, of decision rule 5 when 9 £ G is the true 
state of nature is defined as 

R{6,6) = f L(5(y),9)p e (y)dfi(y). 

For each decision rule S there is a risk function R(8, •), and the goal of non-Bayesian 
decision theory is to choose the decision rule 5 whose risk function R(S, •) is the "smallest" 
in some sense. Often there is no such rule S which gives a uniformly smallest risk function; 
in such cases, co ncessions must be made by im posing; certain constraints, like unbiasedness 



or equivariance (ILehmann and Casellalll998h . 



2.2 Bayesian decision theory 

In the Bayesian decision problem, there is an additional piece of input required — a prior 
distribution for 9. Equip with an appropriate a- algebra SS and let F be a probability 
measure defined there. On the product space (Yx6,^ f&BS), define a probability measure 
by the density pe(y) dF{9) dfi(y). Two quantities related to this joint distribution are the 
marginal for Y, namely, 

PF(y)= [ Pe(y)dF(9), 



and the conditional distribution of 9 given Y — y, described by Bayes' formula, 

dF(9\y) = {p e (y)/p F (y)}dF(9). 

When the prior F is known, there is a well-developed Bayesian decision theory, de- 
scribed next. On the other hand, when F is unknown, as is often the case in practice, 



3 



some special considerations are needed; see Section [3] When F is known, define the Bayes 
risk of a decision rule 5 to be the average risk R(5, 9) as 9 various according to the prior 
F; in symbols, 



R(S,9) dF(9). 



The Bayesian decision-theorist seeks the decision rule 5 = 8p that minimizes the Bayes 
risk p(5, F). I will write p(F) = p(5f, F) for this minimal Bayes risk. Below I discuss the 
two most common decision problems: hypothesis testing and point estimation. 

The general hypothesis testing problem considers H : 9 £ O versus Hi : 9 ^ O , 
where O C has positive prior probability, i.e., -F(0 O ) > 0. Here the action space 
is A = {a ,cti} where a« = "choose hypothesis i" . A Type I error is choosing a\ when 
Hq is true, and a Type II error is choosing ao when Hi is true. A typical loss function 
in such testing problem is given by L(ai,9) = Kil® {9) and L(a ,9) = /t 2 (l — Ie (9)), 
where k±, k 2 are finite positive numbers representing the cost of a Type I, Type II error, 
respectively. The corresponding risk function is then a linear combination of the Type I 
and Type II error probabilities. The Bayes rule is given by 



S F (y) = 



if F(Q 
if F(Q 



y)> r 
y) <r 



where -F(0 O | y) is the posterior probability for O , given Y = y, an d r = k?J(k-\ + k 2 ) is 
the relative cost of a Type II error. These details are given in Berger f 19841 . pp. 163-164). 
It is interesting that, for a point-null H : 9 = 9 , the quantity F({9 } \ y) is exactly 
the local false discov ery rate that has appeared fairly recently in the large-sca le multiple 
testing context (e.g., iMartin and TokdaJlioii ; ISun and Caill2007t lEfronlboioh . 

For the estimation problem, I shall assume 9 is the estimand, so that A = 0. The 
most common loss function in such problems is square-error loss, i.e., L(a, 9) = \\a — 9\\ 2 , 
but other losses can be handled similarly. For square-error loss, the Bayes rule is 
the posterior mean of 9 given Y = y, i.e., 5f(v) = J e 9 dF{9 \ y). 



3 Empirical Bayes 

3.1 Setup, motivation, and classical developments 

In the previous section, there was a single observation Y (not necessarily real-valued) 
and a corresponding single parameter 9 (also not necessarily real- valued) . Corresponding 
hierarchical model for Y is as follows: 

Y\9~ Pe ( y ) and 9 ~ F, (1) 

In this case, very little can be done when F is unknown; indeed, Y provides information 
about just a single observation from F which, in turn, contributes nothing to one's lack 
of knowledge about F. However, nowadays, there are applications which can be modeled 
by many samples from the hierarchical model (pQ). Specifically, pairs (Yi, 9i), . . . , (Y n , 9 n ) 
are sampled independently from the joint (Y,9) distribution in ([!]), but only the Y's 
are observed. DNA microarray tec hnologies and the r elated statistical problems spurred 
much of the growth in this lEfronl d2008l . l2010h . This model has two key features: 
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• the number of cases n is typically very large, say tens of thousands; 

• the cases are parallel in the sense that each Y has a corresponding (latent) 9i which 
is an independent copy of the single 9 seen in the previous sections. 

Together, these two features provide the following intuition: by treating the observed data 
(Yi, . . . , Y n ) as a proxy for the unobserved parameters (9\, . . . , 9 n ), a large independent 
sample from F, it should be possible to estimate F empirically. 

The canonical high-dimensional model is the normal mean model, i.e., Yj \ 9i ~ 
N(6 l j, 1), i = 1, . . . ,n. This seemingly simple mod el has given rise to many fundamen- 
tal developments in statistics. Indeed, ISteinl ( 1l98ll ) showed that the high-dimensionality 
alone is cause for statisticians to rethink their approach. The fundamental idea be- 
hind modern approaches to high-dimensional problems is that inference can be improved 
by sharing information between cases, and frequentists and Bayesians alike have in- 
corporated this idea into their respective analyses; e.g., FD R controlling procedure s 
( jBenjamini and Hochberglll995l ) and hierarchical Ba yes methods (IScott and Bergerll2006l ). 



The empirical Bayes approach (e.g., IRobbinsI Il964f ) falls somewhere in between the fre- 
quentist and Bayesian extremes. It starts with a Bayesian model and uses the observed 
data Yi, . . . , Y n to estimate the prior. This easily and naturally facilitates the sharing of 
information betwe en case s . Par ametric empirical Bayes methods have received consider- 
able attention; 
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Efronl ( 120101 ) and the references therein. The James-Stein estimator 
is a classical example, where (#i, . . . , 9 n ) is assigned a Gaussian prior with variance es- 
timated from the data. But the very-high- dimension of modern problems suggests that 
the more robust nonparametric empirical Bayes methods might be successful. 



3.2 Robbins' nonparametric empirical Bayes 



In the high-dimensional case, with n large, i t may not be necessary to impose parametric 
constraints on the unknown prior. iRobbinsI ( 119641 ) considered nonparametric estimation 
of the prior F based on Y 1; . . . , Y n . With an estimate F n of F, the Bayes rule 5p can be 
replaced by a plug-in estimate 5 n = 5^ to be used in a future decision problem. 

Definition 1. Let F n be an estimate of F based on data Y±, . . . ,Y n from the model 
(TTJ). Define 5 n = 5p to be the decision rule obtained by plugging in F n for the true F in 



the Bayes rule 5p. Then p n (F) 
future decision problem. 



p(S n ,F) represents the risk incurred by using S n in a 



The decision rule 5 n in Definition [T] is called an empirical Bayes rule and p n {F) the 
corresponding e mpirical Bayes r isk. It is important to note that Definition [TJ is not the 



same as that of IRobbinsI (119641 ) and others; this classical risk involves an expectation 
over the observed data sequence. Therefore, Robbins' empirical Bayes risk is a number 
whereas p n (F) is a random variable. 

Remark 1. The decision-theoretic formulation of the empirical Bayes problem given 
above is based on minimizing the risk in a future decision problem. In practice, however, 
we are often interested in the "compound problem" of making decisions about 9i, . . . , 9 n 



Samuel 


(1967 


) andlCopas ( 


1969) 
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rule for the compound problem is to apply the Bayes rule for a future decision to each 
component problem. Therefore, the natural approach that is typically used in high- 
dimensional applications is to apply the resulting empirical Bayes rule for a future decision 
to each component problem. See Section 



4 Asymptotic optimality 

By definition, p n (F) > p(F) almost surely. But, as n — >■ oo, we have more data with 
which to construct F n , so we might expect to be able to get close to the Bayes risk 
asymptotically. It is in this regard that we measure the performance of 8 n . 

Definition 2. Let F be a given collection of probability measures, assumed to contain 
the true prior F. A sequence of decision functions 5 n is asymptotically optimal relative 
to F if p n (F) ->■ p(F) almost surely for all F G F. 

Asymptotic optimality in Defintion |2] is different than that of Robbins. Indeed, Rob- 
bins' asymptotic "E-optimality" includes an additional expectation over the data sequence 
Y]_, Y 2 , .... While asymptotic optimality need not imply asymptotic E-optimality, the dif- 
ference is important from a statistical point of view: the former means that, for large n, 
the decision procedure has small risk for (almost) every data sequence, whereas the latter 
means the decision procedure does well only on average. Clearly, asymptotic E-optimality 
means very little to a Bayesian who does not believe in averaging over Y. 

Next is a general theorem on asymptotic op timality, similar to that for asymptotic 
E-optimality found in iDeely and Zimmer fll976h . 



Theorem 1. For F G F, assume that 5 n (y) — > 5f(v) almost surely for p-almost ally, 
that L(S n (y), 9) — > L(b~ F {i)), 9) almost surely for (p x F) -almost all (y, 9), and that there 
exists a sequence of integrable functions h n (y, 9) = h n (y, 9; Yy, . . . , Y n ) such that 

• h n (y, 9) — > h{y, 9) almost surely for {p x F) -almost all (y, 9), 

• L(5 n (y),9) < h n (y,9) almost surely for alln and for {p x F) -almost all (y,9), and 

• J Y Is h n{y, 9)p e {y) dF{9) dp(y) ->■ f Y f e h(y, 9)p 9 (y) dF(9) dp(y) < oo almost surely. 

Then 5 n is asymptotically optimal relative to F. 

Proof. The proof is a simple appli cation of the dominated convergence theorem or, 



more specifically, the main theorem of iPrattl ( 119601 ). Write 




limp n (F)= hm / / L(5 n (y),9)p e (y)dF(9)dp(y). (2) 



It remains to show that, with probability 1, limit and integration can be interchanged. 
Let be the appropriate u-algebra on Y°° and let P^" be the distribution measure 
of Yi,Y 2 , . . .. There are five "P^-almost surely" assumptions made in the theorem: one 
about 5 n , one about the loss L(5 n , 9), and three about h n . Let A±, . . . , A 5 G denote 
the events where these respective assumptions are true. By assumption, Pp'(Ai) = 1, for 
% = 1, . . . , 5. For any data sequence in A± l~l • • • D A$, interchange of limit and integration 
in fl2]) holds by Pratt's theorem. The claim follows since P™(Ai D • • • D A 5 ) = 1. □ 
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The assumption that the loss converges can be easily checked in practice. For example, 
to estimate a real 9, the loss L(a, 9) is typically a continuous function of the action 
(estimate) a and the parameter 9 such as L(a, 9) = (a — 9) 2 . In other problems, such as 
hypothesis testing, the action space A has only a finitely many elements and the desired 
loss convergence obtains in all but the strangest of cases. 



5 Nonparametric estimation of the prior F 

5.1 Predictive recursion 

Robbins' nonparametric empirical Bayes analysis requires a nonparametric estimate of 
the prior F. There are a variety of methods available for this task, e.g., nonparametric 
maximum likelihood, deconvolution, etc. Here I focus on a relatively new method, namely 
predictive recursion. It is interesting t hat the predictive recurs ion (PR) algorithm boils 
down to a stochastic a pproximation ( Marti n and Ghoshl l2008h . one of Robbins' other 
great contributions (see Robbins and Monrolll951 : Laill2003f h 

PR is a fast, stochastic algorithm for estimating mixing distributions in nonparametric 
mixture models. PR's original motivation was as a computationally efficient alternative to 
Markov chain Monte Carlo method s in fitting Bayesian Dirichlet process mixture models 
( iNewton et al.lll998l ; lNewtonll2002l ). If, or to what extent, the PR estimates approximate 
the Bayesian estimates in a Dirichlet process mix ture model remains an open question; 
however, simulations and theoretical arguments in iTokdar et al.l ( 120091 ) indicate that PR 
is indeed an attractive alternative. 

Let be the marginal distribution of the individual Yj's, having density Pf(v) = 
J Pe{y) dF(9) with respect to \x. For observations Y±, . . . , Y n from Pp, the PR algorithm 
for nonparametric estimation of F and pf is as follows. 



PR Algorithm. Choose a starting value Fq to initialize the algorithm, and a sequence 
of weights {wi : i > 1} C (0, 1). For i = 1, . . . , n, repeat 



Pi-i(y) = J Pe(y)dFi(9), 

dFi(9) = (1 - Wi) dF t ^(9) + w iP e(Xi) dF^{9) / Vi(^) • 



(3) 
(4) 



Produce F n and p n = Pf„ as the final estimates of F and P p, respectively. 

An important property of PR is its speed and ease of implementation. Also, PR has 
the unique ability to estimate a mixing distribution F which is absolutely continuous 
with respect to any user-specified dominating a- finite measure v on 0. Indeed, it is easy 
to see that F n dominated by Fq for all n. Therefore, if Fq has a density with respect to 
v, then so does F n . C ompare this to the nonparametric maximum likelihood estimate 
which is a.s. discrete (ILindsayl 11995). This prop e rty is particularly important in the 
multiple testing application in iMartin and Tokdarl ( 120121 ). as ident inability of the model 
parameters requires a careful handling of the underlying dominating measure. 

I should also point out that the PR estimates F n and p n depend on the order in which 
the data enter the algorithm. This dependence is typically weak, especially for large n, 
but to remove this dependence, it is standard to average the PR estimates over several 
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randomly chosen data permutations; see Section [6j iTokdar et all ( 120091 ) make a formal 
case, based on Rao-Blackwellization, for averaging PR over permutations. 

A sum mary of PR's convergence properties was given in Section [TJ Here I state a 
theorem of iMartin and Tokdarl (120091 ) which describes the behavior of F n and p n in the 
case where is not necessarily finite. This result will be used to establish asymptotic 
optimality of PR-based nonparametric empirical Bayes rules in Section 15.21 Let F be 
(a subset of) the set of probabilities measures F on 6. For densities p and p' on Y, let 
K{p,p') = J \og(p/p')pd/j, be the Kullback-Leibler divergence of p' from p. Consider the 
following set of assumptions. 

Al. The set F of candidate F's is precompact in the weak topology. 

A2. 9 i — y pe(y) is bounded and continuous for /i-almost all y. 

A3. The PR weights (w n ) C (0, 1)°° satisfy J2 n w n = oo and J2 n w n < °°- 

A4. There exists C < oo such that sup 01 &2 03 f (pe 1 /pe 2 ) 2 pe 3 d\i < C . 

A5. Identifiability: If pp = Pf> /i-almost everywhere for some F, F' £ F, then F — F' . 

A6. For any e > and any compact Y' C Y, there exists a compact 0' C such that 
J Y ' Pe(y) d^{y) < e f° r a11 # e 0'. 



Theorem 2 (IMartin and Tokdarl 120091 ). Under A1-A4, K(p F ,p n ) -» P F -almost 
surely. Furthermore, under A1-A6, F n — >• F in the weak topology, Pp-almost surely. 



Remark 2. IMartin and Tokdarl (120091 ) discuss the conditions and ways they can be 
relaxed. Condition A4 is the strongest, but it holds generally for exponential families 
whose sufficient statistic has bounded moment-generating function on 0. 

Remark 3. The PR weights are often taken a s w n = (n + 1)~ 7 for some 7 £ (1/2, 1], 
which clearly satisfies A3. If 7 £ (2/3,1], then IMartin and Tokdarl (120091 ) establish a 



o(n 
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bound on the Kullback-Leibler rate of convergence. 



A generalization of the nonparametric mixture model Yi, . . . , Y n ~ PFiy) is the semi- 
parametric problem where, in addition to the unknown prior / mixing: distribution F , there 
is a finite-dimensional parameter u to be estimated as well. IMartin and Tokdarl (120 111 ) 
propose an extension of the PR algorithm for simultaneous estimation of (F,u), based 
on the interesting construction of a PR-based likelihood function for u. They show that 
this PR-based likelihood function approximates the marginal likelihood under a Bayesian 
Dirichlet process m i xture mo del. Applicatio ns of this methodology can be found in 
Martin and Tokdarl (120121 ) and IMartin! (l2012ah . 



5.2 Nonparametric empirical Bayes via PR 

The advantage of Robbins' brand of nonparametric empirical Bayes is that, once an 
estimate F n of F is available, the inference problem is straightforward. That is, one 
simply finds the Bayes rule 5p that depends on the unknown F, and then replaces that 
with 5 n = 5F n . PR seems to be ideally suited to this problem. The available asymptotic 
theory for the PR estimate F n in Theorem [2] will be applied, along with Theorem [H to 
prove that the PR-based plug-in nonparametric empirical Bayes rule is asymptotically 
optimal in the sense of Definition [2j 
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Start with the hypothesis testing problem described in Section 12.21 If F n and p n are 
estimates of F and pp based on the PR algorithm, then the corresponding empirical 
Bayes rule 5 n {y) = S Fn (y) is given by 

5 n (y) = h iiF ^\y)>r 

nVy) \ax ifF n (0 o |2/)<r 1 ' 

We now prove the following asymptotic optimality result. 

Proposition 1. If Pq is a continuous distribution, L(a,8) is as described in Sec- 
tion \2.Bf and the conditions of Theorem [H hold, then 5 n in (jSJ) is asymptotically optimal 
with respect to F in the sense of Definition [H 

Proof. Under the conditions of Theorem [21 it is clear that F n (Q | y) — > F(Q \ y) 
almost surely for all y. The continuity assumption implies the true posterior probability 
F(Oo I v) IS °ff the threshold r with probability 1, so it then follows that S n (y) — > Sp(y) 
almost surely for each y. Since the loss L(a, 9) is uniformly bounded, the choice h n (y, 9) = 
sup a d L(a,9) in Theorem [1] shows that 5 n is asymptotically optimal. □ 

Things are a bit more challenging in the estimation problem in that more conditions 
are required to establish asymptotic optimality of the PR-based empirical Bayes rule. 
Suppose, for example, that CR and L(a,9) = (a — 6 1 ) 2 , square-error loss. Then the 
Bayes rule is the posterior mean and, hence, the PR-based empirical Bayes rule is 



S n (y) = / 6dF n (6 | y) = —— / 6 Ve {y)dF n {6). 
Jo Pn{y) Je 

Notice that conditions of Theorem [2] are not enough to conclude S n (y) — > Sp(y) a.s. for 
each y. For this to follow, we need 9 >->■ Ope(y) to be bounded for each y; this is satisfied 
if, for example, pe is a N(#, 1) density. Since the loss is unbounded in general, finding a 
bounding sequence h n (y,9) as in Theorem [1] must be done carefully case-by-case. How- 
ever, a general optimality result holds under the extra condition that G and, hence, A 
are compact. This is not really a restriction, in this case, since verifying condition Al in 
Theorem [2] usually requires to be compact anyway. 

Proposition 2. If L(a,6) is bounded on A x ; 9 i— > 9po(y) is bounded for each y, 
and the conditions of Theorem^ hold, then the PR-based empirical Bayes rule 5 n (y) is 
asymptotically optimal in the sense of Definition^ 

Proof. Take h n (x,9) = sup a6 ,L(a, 9) and apply Theorem [TJ □ 

6 Baseball example 

6.1 Model, data, and objectives 

Empirical Bayes analysis of hitting per formance in majo r league bas e ball has been a 



recurring theme in t he lite rature, e.g.. lEfron and Morris! ( 119731 . Il975l ). iBrownl (120081 ) 



Muralidharanl (120101 ) . and Ijiang and Zhangl (j2010l ). In these papers, focus has been 
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on using data on each players' batting performance in the first half of the season to 
simultaneously predict their batting performance in the second half of the season. Due 
to the large number of players in consideration (roughly 500 in the analysis that follows), 
prediction is improved by pooling information across the different players. Empirical 
Bayes is a particularly convenient way to perform this information sharing. 

The model setup is as follows. In the first half of the season, Player i, % = 1, . . . , n, has 
rii at-bats, and his number of hits Yi is modeled as Yi ~ Bin(rii,#j), where 9i represents 
Player z's latent hitting ability. This is an unrealistic setup (for a variety of reasons), but 
makes for a relatively simple analysis. The goal is first to estimate (#!,..., 9 n ) based on 
data for all n players from the first half of the season. Then these estimates will be used 
to generate a prediction for the second half hitting performance, and the performance of 
the estimation procedure will be judged by how well the method predicts. 

The data consists of batting records for each player involved in the 2005 major league 
baseball season. In Brown's study, he splits the data into first and second half statistics — 
these are the "training" and "testing" portions. Some players had insufficient number 
of at-bats, and were removed from the sample. So the essentially both training and 
testing portions contain data for the same players; the only caveat is that a few players 
with sufficient number of at-bats in the first half but an insufficient number in the sec- 
ond half (perhaps due to injury). Brown also introduces a suitable variance-stabilizing 
transformation to take the original binomial data to approximately normal data, so that 
the standard procedures (e.g., James-Stein) can be easily applied. Specifically, the new 
model is Xi ~ N(£j, l/4nj) (approximately), for i — 1, . . . , n, where = arcsin y/dl, and 
the goal is to simultaneously estimate (£i, . . . , £ n ) based on the first half data and then 
gi ve a prediction of the observed (X[, . . . , X' n ) in the second half. The reader is referred 



to 



Brown! (120081) for details on the variance-stabilizing transformation [eq uation ( 2 .2) in 



Brown! ( 20081 . p. 121)] and prediction error calculations [expression tse in Brown ( 2008 



p. 126)]; suffice it to say that small prediction error is preferred. 
6.2 Results 

For data Xi ~ N(£j, l/4rij), a variety of methods are available for estimating (£ l5 . . . , £ n ). 
One is to estimate & with Xi\ the performance of this "naive" procedure is taken as the 
baseline for comparison. Another option is to estimate all £j's with the common sample 



mean X, the group mean. iBrownl (120081 ) describes a number of other, more sophisticated 



Ba yes and empiri c al Ba yes methods. 



Muralidharanl (120101 ) describes a method — called mixfdr — which is based on a finite 
mixture model for the unknown prior distribution. This method can be naturally applied 
directly to the binomial data, the (Y~i, . . . , Y n ), so the transformed data is not needed. In 
this setting, he models the unknown prior f(6) as a finite mixture of beta densities, and 
uses Type II maximum likelihood to estimate the mixture model parameters. 

PR can also be applied to the binomial data directly. The conditions for Theorem [2] 
can readily checked for this binomial problem; see Remark [2J For the initial guess /o, I 
employ some basic knowledge about the context to make an informative choice. In partic- 
ular, for pitchers, who tend to have lower batting averages, I take fo to be a Beta (30, 120) 
distribution, so that the mean is at 0.200. Likewise, for non-pitchers, who typically have 
higher batting averages for pitchers, I take f to be a Beta (30, 90), so that the mean is at 
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0.250. For the weight sequence, consider Wi = (z + l)~ 7 as in Remark[3j If 7 is treated like 
a tuning parameter, we can choose the value of 7 to minimize Brown's prediction error. 
This optimization problem was solved for the pitcher and non-pitcher sets separately, 
giving 7 = 0.5 for pitchers and 7 = 0.9 for non-pitchers. Lastly, the results of the PR 
algorithm are averaged over 100 random permutations of the data to remove dependence 
on the original ordering. 

In Table [U I repeat a portion of Muralidharan's Table 1, together with the correspond- 
ing PR results. The results in the top portion of the table are based on the transformed 
data. Since both PR and mixfdr are applied to the original data, the reported predictions 
used are the posterior means of arcsin y/8~i based on the estimated priors. In this case, 
the PR method is a clear winner when applied to the pitcher portion of the data, and is 
competitive in the non-pitcher portion, beating all methods exce pt mixfdr, including the 
theore tically strong nonparametric empirical Bayes procedure of iBrown and Greenshtein 
( 120091 ). That PR performs well in the smaller-scale pitcher portion of the data suggests 
that it makes more efficient use of the limited information compared to other methods. 
Figure [T] shows both the PR and mixfdr estimates of the prior density f(8) for both 
pitcher and non-pitcher batting averages. In both cases, I would argue that the PR 
estimates are much more realistic than the mixfdr estimates. For pitchers, the mixfdr 
estimate has some peculiar features. That there seems to be two subgroups is itself not 
a concern, but the relative proportions are questionable: among pitchers, there may be a 
relatively small subgroup who are strong hitters, but the plot indicates that a majority 
of pitchers fall in this "extraordinary" group. The PR estimate, on the other hand, is 
smooth and unimodal, with a slight skew to the right indicating a few skillful hitters as 
outliers in this group of pitchers. For the non-pitchers, the support of the mixfdr estimate 
is questionable. Many major league players hit higher than 0.300 on a regular basis, e.g., 
Ichiro Suzuki, arguably one of the best hitters in baseball, has a career batting average of 
0.324, marked by a A in Figure QJb). This value is an extreme outlier under the mixfdr 
estimate, but sits nicely at the tip of the upper tail of the PR estimate. On the other end, 
there are players who consistently hit near 0.200. Typically these players are strong at 
defense to make up for their relatively poor offensive performance. Henry Blanco, whose 
career batting average is 0.228, also marked by a A in Figure [T^b), is one such player. 
Overall, this example suggests that PR-driven nonparametric empirical Bayes gives good 
results in the prediction problem, compared to a variety of methods in both pitcher and 
non-pitcher portions, and can also give a very reasonable picture of the distribution of 
latent hitting abilities. 

One possible extension of the above analysis is to effectively combine the pitcher 
and non-pitcher data together to achieve further sharing of information. Ignoring the 
information contained in the pitcher /non-pitcher label is not an effective approach. One 
possible alternative is to add another parameter to deal with the pitcher/non-pitcher 
information. For example, if Xj = 1 if player i is a pitcher and Xi = otherwise, then 
the model could be modified as follows: Yi\(Xi,9i) ~ Bin(rij, u Xl 9i), i = 1, . . . ,n, where 
oj G (0, 1) is an unknown shrinkage factor describing the overall discount in pitchers' 
hitting ability compared to non-pitchers'. This approa ch can easily be handled within 
the predictive recursion marginal likelihood framework f lMartin and Tokdarll201lf ). but I 
shall omit these details here since it takes us outside the context where PR optimality 
results are available. 
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Pitchers 


Non-pitchers 


Number of training players 


81 


486 


Number of test players 


64 


435 


Naive 


1 


1 


Group mean 


0.127 


0.378 


Parametric EB (MM) 


0.129 


0.387 


Parametric EB (ML) 


0.117 


0.398 


Nonparametric EB 


0.212 


0.372 


James-Stein 


0.164 


0.359 


Hierarchical Bayes 


0.128 


0.391 


mixfdr EB 


0.156 


0.314 


PR-based EB 


0.096 


0.353 



Table 1: Relative predicti on erro r s for v ario us empirical Bayes est imation methods in the 
baseball data example of iBrownl ( 120081 ) and iMuralidharanl (120101 ) . 



7 Discussion 

In this paper I have considered the empirical Bayes approach to statistical inference and 
its implementation via the PR algorithm. In particular, I have shown that PR-based 
empirical Bayes rules are asymptotically optimal under certain conditions. The question 
of asymptotic optimality of empirical Bayes estimation in high-dimensional problems 
where, e.g., the level of sparsity depends on the dimension, is more challenging, and more 
work is needed to establish this for the PR procedure presented herein. However, the 
fact that PR empirically outperforms metho ds (e.g., the nonparametric empirical Bayes 



procedure of iBrown and Greenshteinl (120091 ) appearing in the baseball example above 



which are known to be asymptotically optimal in this strong sense suggests that the PR 
procedure has similar theoretical properties. 

Classical results on empirical Bayes analysis rely heavily on the concept of asymptotic 
E-optimality. I argue that asymptotic optimality in Definition [2] is more meaningful from 
a statistical point of view. In either case, asymptotic optimality is clearly a desirable 
property; but one could certainly argue that asymptotic optimality is not the only qual- 
ity to consider. Robbins and others proposed empirical Bayes rules which were derived 
from, or at least motivated by, asymptotic optimality considerations. These procedures 
often came under criticism since the justification based on asymptotic optimality was 
not convincing and their performance in real applications was unsatisfactory. In this era 
of high-dimensional problems, the sample sizes needed for asymptotic optimality to be 
meaningful in practice are now readily available. I argue that a procedure which is both 
asymptotic optimal and can be justified by other means is most convincing, and here I 
have shown that PR is such a procedure. But when n is large, there are many other 
justifiable alter natives — such as e stima ting F by the method of maximum likelihood or 



the method of iDeely and Krusd (119681 ) — which would also lead to asymptotically opti- 
mal procedures, so what makes PR stand out? Although these alternatives have similar 
asymptotics, in finite samples they typically give estimates of F which are discrete. This 
is clearly inappropriate if vague prior information indicates that F is continuous. Another 
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n 1 1 1 1 1 r n 1 1 1 1 n 

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.15 0.20 0.25 0.30 0.35 0.40 

e e 

(a) Estimated priors: pitchers (b) Estimated priors: non-pitchers 

Figure 1: Plots of estimates of the prior f{9) based on PR and Muralidharan's mixfdr. 
In panel (b), As mark the career batting averages of Henry Blanco (0.228) and Ichiro 
Suzuki (0.324), respectively (as of 2012). 



i ssue is identifiability. In the "two-groups" problems considered in Martin and Tokdar 
( 120121 ). F is assumed to have both discrete and continuous components. The PR algo- 
rithm can easily handle this type of vague prior information, whereas maximum likelihood 
requires additional assumptions, for example, to identify each component. 
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